285 research outputs found
Neural Empirical Bayes
Peer reviewe
Causal Representation Learning Made Identifiable by Grouping of Observational Variables
A topic of great current interest is Causal Representation Learning (CRL),
whose goal is to learn a causal model for hidden features in a data-driven
manner. Unfortunately, CRL is severely ill-posed since it is a combination of
the two notoriously ill-posed problems of representation learning and causal
discovery. Yet, finding practical identifiability conditions that guarantee a
unique solution is crucial for its practical applicability. Most approaches so
far have been based on assumptions on the latent causal mechanisms, such as
temporal causality, or existence of supervision or interventions; these can be
too restrictive in actual applications. Here, we show identifiability based on
novel, weak constraints, which requires no temporal structure, intervention,
nor weak supervision. The approach is based assuming the observational mixing
exhibits a suitable grouping of the observational variables. We also propose a
novel self-supervised estimation framework consistent with the model, prove its
statistical consistency, and experimentally show its superior CRL performances
compared to the state-of-the-art baselines. We further demonstrate its
robustness against latent confounders and causal cycles
A mixture of sparse coding models explaining properties of face neurons related to holistic and parts-based processing
Experimental studies have revealed evidence of both parts-based and holistic representations of objects and faces in the primate visual system. However, it is still a mystery how such seemingly contradictory types of processing can coexist within a single system. Here, we propose a novel theory called mixture of sparse coding models, inspired by the formation of category-specific subregions in the inferotemporal (IT) cortex. We developed a hierarchical network that constructed a mixture of two sparse coding submodels on top of a simple Gabor analysis. The submodels were each trained with face or non-face object images, which resulted in separate representations of facial parts and object parts. Importantly, evoked neural activities were modeled by Bayesian inference, which had a top-down explaining-away effect that enabled recognition of an individual part to depend strongly on the category of the whole input. We show that this explaining-away effect was indeed crucial for the units in the face submodel to exhibit significant selectivity to face images over object images in a similar way to actual face-selective neurons in the macaque IT cortex. Furthermore, the model explained, qualitatively and quantitatively, several tuning properties to facial features found in the middle patch of face processing in IT as documented by Freiwald, Tsao, and Livingstone (2009). These included, in particular, tuning to only a small number of facial features that were often related to geometrically large parts like face outline and hair, preference and anti-preference of extreme facial features (e.g., very large/small inter-eye distance), and reduction of the gain of feature tuning for partial face stimuli compared to whole face stimuli. Thus, we hypothesize that the coding principle of facial features in the middle patch of face processing in the macaque IT cortex may be closely related to mixture of sparse coding models.Peer reviewe
Density Estimation in Infinite Dimensional Exponential Families
In this paper, we consider an infinite dimensional exponential family,
of probability densities, which are parametrized by functions in
a reproducing kernel Hilbert space, and show it to be quite rich in the
sense that a broad class of densities on can be approximated
arbitrarily well in Kullback-Leibler (KL) divergence by elements in
. The main goal of the paper is to estimate an unknown density,
through an element in . Standard techniques like maximum
likelihood estimation (MLE) or pseudo MLE (based on the method of sieves),
which are based on minimizing the KL divergence between and
, do not yield practically useful estimators because of their
inability to efficiently handle the log-partition function. Instead, we propose
an estimator, based on minimizing the \emph{Fisher divergence},
between and , which involves solving a
simple finite-dimensional linear system. When , we show that
the proposed estimator is consistent, and provide a convergence rate of
in Fisher
divergence under the smoothness assumption that for some , where is a certain
Hilbert-Schmidt operator on and denotes the image of
. We also investigate the misspecified case of
and show that as , and provide a rate for this convergence under a
similar smoothness condition as above. Through numerical simulations we
demonstrate that the proposed estimator outperforms the non-parametric kernel
density estimator, and that the advantage with the proposed estimator grows as
increases.Comment: 58 pages, 8 figures; Fixed some errors and typo
The Optimal Noise in Noise-Contrastive Learning Is Not What You Think
Publisher Copyright: © 2022 Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence, UAI 2022. All right reserved.Learning a parametric model of a data distribution is a well-known statistical problem that has seen renewed interest as it is brought to scale in deep learning. Framing the problem as a self-supervised task, where data samples are discriminated from noise samples, is at the core of state-of-the-art methods, beginning with Noise-Contrastive Estimation (NCE). Yet, such contrastive learning requires a good noise distribution, which is hard to specify; domain-specific heuristics are therefore widely used. While a comprehensive theory is missing, it is widely assumed that the optimal noise should in practice be made equal to the data, both in distribution and proportion; this setting underlies Generative Adversarial Networks (GANs) in particular. Here, we empirically and theoretically challenge this assumption on the optimal noise. We show that deviating from this assumption can actually lead to better statistical estimators, in terms of asymptotic variance. In particular, the optimal noise distribution is different from the data's and even from a different family.Peer reviewe
Sparse Linear Identifiable Multivariate Modeling
In this paper we consider sparse and identifiable linear latent variable
(factor) and linear Bayesian network models for parsimonious analysis of
multivariate data. We propose a computationally efficient method for joint
parameter and model inference, and model comparison. It consists of a fully
Bayesian hierarchy for sparse models using slab and spike priors (two-component
delta-function and continuous mixtures), non-Gaussian latent factors and a
stochastic search over the ordering of the variables. The framework, which we
call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and
bench-marked on artificial and real biological data sets. SLIM is closest in
spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in
inference, Bayesian network structure learning and model comparison.
Experimentally, SLIM performs equally well or better than LiNGAM with
comparable computational complexity. We attribute this mainly to the stochastic
search strategy used, and to parsimony (sparsity and identifiability), which is
an explicit part of the model. We propose two extensions to the basic i.i.d.
linear framework: non-linear dependence on observed variables, called SNIM
(Sparse Non-linear Identifiable Multivariate modeling) and allowing for
correlations between latent variables, called CSLIM (Correlated SLIM), for the
temporal and/or spatial data. The source code and scripts are available from
http://cogsys.imm.dtu.dk/slim/.Comment: 45 pages, 17 figure
- …